7 - Differential Gene Expression & Level 1 annotation

Author

CDN team

Published

May 23, 2024

Introduction

In this notebook we are going to look at how to interpret and visualize gene-level statistics obtained from differential expression analysis. We are not going to go into which method should be used to carry out differential gene expression analysis but we highly recommend giving a read to A comparison of marker gene selection methods for single-cell RNA sequencing data by Jeffrey M. Pullin & Davis J. McCarthy if you’re interested in digging deeper!

Some other interesting papers and twitter discussions can be found here:

Key Takeaways

  • To annotate our clusters we need to determine which genes are differentially expressed in each one.

  • We can quantify these differentially expressed genes using effect size and discriminatory power metrics such as log2FC and AUC.

  • Differential gene expression metrics vary depending on the groups of cells we are comparing.

  • P values obtained from carrying out DGE analysis between clusters are inflated and should not be used.

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages("RColorBrewer")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")

if (!requireNamespace("presto", quietly = TRUE))
    devtools::install_github("immunogenomics/presto")


### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(colorBlindness)
library(RColorBrewer)
library(DT)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.

# Download the data in data/ directory
# download.file(
#     url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
#     destfile = "../data/workshop-data.rds",
#     method = "wget",
#     extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds

se <- readRDS("../data/d8e35450-de43-451a-9979-276eac688bce.rds")

Generate a color palette for our cell types

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
# nb.cols <- length(unique(se$Celltype))
# mycolors <- colorRampPalette(paletteMartin)(nb.cols)
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

Analysis

Convert ENSEMBL IDs to Gene Symbols

Right away we can see how ensembl ids are used in the rownames. Let’s transform them into their matched symbols to make them human-readable:

head(rownames(se))
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"

Convert to gene symbols

gene_df <- readr::read_csv(file = "../data/cov_flu_gene_names_table.csv")

symbol_id <- data.frame(index = rownames(se)) %>%
    left_join(gene_df, by = "index") %>%
    pull(feature_name)

# re-create seurat object
mtx <- se@assays$RNA$data
rownames(mtx) <- symbol_id
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)
rm(mtx); gc()
            used   (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
Ncells   3715605  198.5    5550351  296.5         NA   5550351  296.5
Vcells 180966095 1380.7  781825193 5964.9      98304 818857589 6247.4

Quick processing

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>%
    RunPCA(verbose = FALSE)

ElbowPlot(se, ndims = 50)

Let’s run FindNeighbors and FindClusters to label our data:

se <- FindNeighbors(se, reduction = "pca") %>% 
    FindClusters(resolution = c(0.01, 0.05, 0.1, 0.25))
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9956
Number of communities: 4
Elapsed time: 9 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9829
Number of communities: 7
Elapsed time: 10 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9721
Number of communities: 9
Elapsed time: 10 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 1894745

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9532
Number of communities: 15
Elapsed time: 11 seconds

Let’s compute the UMAP

se <- se %>%
    RunUMAP(dims = 1:30, verbose = FALSE)

We can now take a look on the UMAP space how the clusters look

se$sample_id <- se$`Sample ID`
DimPlot(
    se,
    group.by = c(
        "RNA_snn_res.0.01", "RNA_snn_res.0.05",
        "RNA_snn_res.0.1", "RNA_snn_res.0.25"),
    label = TRUE)

dim_plt <- DimPlot(
    se,
    group.by = c("RNA_snn_res.0.05"),
    label = TRUE)

And the original cell type labels + the sample IDs

DimPlot(
    se,
    group.by = c("Celltype", "sample_id"),
    label = FALSE)

For the purpose of this tutorial we’re going to go forward with resolution 0.05!

DGE Wilcoxon

The different implementations Seurat incorporates provides in FindAllMarkers compare the gene expression between 2 groups of cells. This one vs all strategy is very quick and returns the avg_log2FC. This avg_log2FC is computed as detailed here & here. Since we’re working with normalized data the log2FC can be directly computed by subtracting the average expression between both groups - log(\frac{exp1}{exp2})=log(exp1)-log(exp2)

Idents(se) <- "RNA_snn_res.0.05"
mgs <- FindAllMarkers(
    se,
    test.use = "wilcox",
    slot = "data",
    only.pos = TRUE,
    logfc.threshold = 0.5,
    min.pct = 0.25)

Look at the results in a dynamic table:

DT::datatable(mgs, filter = "top")

Look at the results in a heatmap

top10 <- mgs %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 10) %>%
    ungroup()

DoHeatmap(se, features = top10$gene) + NoLegend()

Annotation

Cluster 0 & 4

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("CD3D", "CD3D", "TRAC", "TRBC2", "CD8B", "CD4")) +
    dim_plt

VlnPlot(
    se,
    features = c("CD3D", "CD3D", "TRAC", "TRBC2", "CD8B", "CD4"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Clusters 0 & 4 seem to have a lot of expression of T cell related genes so at this level 1 we are going to label them as T cells.

Cluster 1

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("CD14", "S100A8", "VCAN", "LYZ", "MS4A6A")) +
    dim_plt

VlnPlot(
    se,
    features = c("CD14", "S100A8", "VCAN", "LYZ", "MS4A6A"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 1 is expressing a lot of monocyte-like genes, at this level 1 we are going to label them as monocytes.

Cluster 2

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("MS4A1", "CD79A", "CD79B", "IGHD", "IGHM")) +
    dim_plt

VlnPlot(
    se,
    features = c("MS4A1", "CD79A", "CD79B", "IGHD", "IGHM"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 2 is expressing B cells genes

Cluster 3

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("PF4", "GP9", "PPBP")) +
    dim_plt

VlnPlot(
    se,
    features = c("PF4", "GP9", "PPBP"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 3 is expressing platelet genes

Cluster 5

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("HBA1", "HBA2", "HBB", "HBD")) +
    dim_plt

VlnPlot(
    se,
    features = c("HBA1", "HBA2", "HBB", "HBD"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 5 is expressing hemoglobin genes so they are likely RBC

Cluster 6

Let’s look at genes that are differentially expressed

FeaturePlot(
    se,
    features = c("SHD", "SHC", "LILRA4", "CLEC4C", "IL3RA", "IRF4")) +
    dim_plt

VlnPlot(
    se,
    features = c("HBA1", "HBA2", "HBB", "HBD"),
    group.by = "RNA_snn_res.0.05") +
    dim_plt

Cluster 6 is expressing genes predominantly expressed by pDCs.

Annotate

According to the markers observed we can make a first general annotation

se@meta.data <- se@meta.data %>%
  dplyr::mutate(
    annotation_lvl1 = dplyr::case_when(
      RNA_snn_res.0.05 == 0 ~ "T cells",
      RNA_snn_res.0.05 == 1 ~ "Monocytes", #
      RNA_snn_res.0.05 == 2 ~ "B cells",
      RNA_snn_res.0.05 == 3 ~ "Platelets",
      RNA_snn_res.0.05 == 4 ~ "T cells", # 
      RNA_snn_res.0.05 == 5 ~ "RBC",
      RNA_snn_res.0.05 == 6 ~ "pDCs")
  )

DimPlot(se, group.by = "annotation_lvl1")

DotPlot summary

order <- c("T cells", "Monocytes", "B cells", "Platelets", "RBC", "pDCs")

se$annotation_lvl1_ord <- factor(
  x = se$annotation_lvl1,
  levels = order)

## Genes for DOTPLOT
dplot_genes <- c(
    # T cell genes
    "CD3D", "CD3E", "TRAC", "TRBC2", "CD8B", "CD4",
    # Monocytes
    "CD14", "S100A8", "VCAN", "LYZ", "MS4A6A",
    # B cells
    "MS4A1", "CD79A", "CD79B", "IGHD", "IGHM",
    # Platelets
    "PF4", "GP9", "PPBP",
    # RBS
    "HBA1", "HBA2", "HBB", "HBD",
    #pDCs
    "LILRA4", "CLEC4C", "IL3RA", "IRF4"
  )

Seurat::DotPlot(
  object = se,
  features = dplot_genes,
  group.by = "annotation_lvl1_ord",
  col.min = 0,
  dot.min = 0) +
  ggplot2::scale_x_discrete(
    breaks = dplot_genes) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 60, hjust = 1)) +
  ggplot2::labs(x = "", y = "")

Extra

See below how the avg_log2FC calculation is done! Code extracted from Seurat’s codebase.

features <- rownames(se) == "MS4A1"
cells.1 <- se$Celltype == "B cell, IgG+"
cells.2 <- se$Celltype != "B cell, IgG+"
data.use <- GetAssayData(object = se, assay.type = "RNA", slot = "data")
pseudocount.use <- 1
base <- 2

# Calculate fold change
mean.fxn <- function(x) {
    return(log(x = (rowSums(x = expm1(x = x)) + pseudocount.use)/NCOL(x), base = base))
  }

data.1 <- mean.fxn(data.use[features, cells.1, drop = FALSE])
data.2 <- mean.fxn(data.use[features, cells.2, drop = FALSE])

# Look at log2FC
(fc <- (data.1 - data.2))

Check if its equal to the avg_log2FC obtained from FindAllMarkers:

fc == mgs[mgs$cluster == "B cell, IgG+" & mgs$gene == "MS4A1", "avg_log2FC"]
Looking into the P-values

More details can be obtained in OSCA.

P values obtained from DGE analysis are inflated and, therefore invalid in their interpretation. We can’t use p-values to reject the Null Hypothesis since we are carrying out data snooping. This means that we are dividing the clusters based on their gene expression, and then computing p-values from the genes that are differentially expressed, even though we already know these genes are differentially expressed since we clustered the data based on them being different.

A way to show this is by looking at how skewed the distributions of the p-values obtained is:

# Compute the p-values without he thresholds
mgs2 <- FindAllMarkers(
    se,
    test.use = "wilcox",
    only.pos = TRUE,
    logfc.threshold = 0,
    min.pct = 0,
    return.thresh = 1,
    max.cells.per.ident = 100 # use 100 cells per cell type for speed
    )

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    # geom_histogram(alpha = 0.3, position = "identity") +
    geom_density(alpha = 0.3) +
    theme_minimal()

ggplot(mgs2, aes(x = p_val, fill = cluster, color = cluster)) +
    geom_histogram(alpha = 0.3, position = "identity") +
    facet_wrap(~cluster, scales = "free") +
    theme_minimal()

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DT_0.33              RColorBrewer_1.1-3   colorBlindness_0.1.9 lubridate_1.9.2      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.4          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.0        tidyverse_2.0.0      Seurat_5.0.3         SeuratObject_5.0.2   sp_2.1-3            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.15.0      jsonlite_1.8.8         magrittr_2.0.3         ggbeeswarm_0.7.2       spatstat.utils_3.0-4   farver_2.1.1           rmarkdown_2.26         vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.2-6 htmltools_0.5.7        gridGraphics_0.5-1     sass_0.4.8             sctransform_0.4.1      parallelly_1.37.0      bslib_0.6.1            KernSmooth_2.23-22     htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             cachem_1.0.8           plotly_4.10.4          zoo_1.8-12             igraph_2.0.2           mime_0.12              lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1               fastmap_1.1.1          fitdistrplus_1.1-11    future_1.33.1          shiny_1.8.0            digest_0.6.34          colorspace_2.1-0       patchwork_1.2.0        tensor_1.5             RSpectra_0.16-1        irlba_2.3.5.1          crosstalk_1.2.1        labeling_0.4.3         progressr_0.14.0       fansi_1.0.6            spatstat.sparse_3.0-3  timechange_0.2.0       httr_1.4.7             polyclip_1.10-6        abind_1.4-5            compiler_4.3.1         bit64_4.0.5            withr_3.0.0           
 [52] fastDummies_1.7.3      MASS_7.3-60            tools_4.3.1            vipor_0.4.5            lmtest_0.9-40          beeswarm_0.4.0         httpuv_1.6.14          future.apply_1.11.1    goftest_1.2-3          glue_1.7.0             nlme_3.1-163           promises_1.2.1         grid_4.3.1             Rtsne_0.17             cluster_2.1.4          reshape2_1.4.4         generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-4    tzdb_0.4.0             data.table_1.15.0      hms_1.1.3              utf8_1.2.4             spatstat.geom_3.2-8    RcppAnnoy_0.0.22       ggrepel_0.9.5          RANN_2.6.1             pillar_1.9.0           limma_3.56.2           vroom_1.6.3            spam_2.10-0            RcppHNSW_0.6.0         later_1.3.2            splines_4.3.1          lattice_0.21-8         bit_4.0.5              survival_3.5-7         deldir_2.0-2           tidyselect_1.2.0       miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.45             gridExtra_2.3          scattermore_1.2        xfun_0.42              matrixStats_1.2.0      stringi_1.8.3          lazyeval_0.2.2         yaml_2.3.8             evaluate_0.23          codetools_0.2-19      
[103] cli_3.6.2              uwot_0.1.16            xtable_1.8-4           reticulate_1.35.0.9000 jquerylib_0.1.4        munsell_0.5.0          Rcpp_1.0.12            globals_0.16.2         spatstat.random_3.2-2  png_0.1-8              ggrastr_1.0.2          parallel_4.3.1         ellipsis_0.3.2         presto_1.0.0           dotCall64_1.1-1        listenv_0.9.1          viridisLite_0.4.2      scales_1.3.0           ggridges_0.5.6         crayon_1.5.2           leiden_0.4.3.1         rlang_1.1.3            cowplot_1.1.3